The Entity Linking & Disambiguation subsystem in KAZU is responsible for identifying entities in text and linking them to external knowledge bases, as well as resolving ambiguities between potential links. This process involves several key steps: initial dictionary-based linking, rule-based filtering for disambiguation, and a sophisticated post-processing phase that applies various mapping and disambiguation strategies. The subsystem leverages TF-IDF for context scoring and manages cross-references between different ontologies to ensure accurate and robust entity resolution. The overall purpose is to provide precise and contextually relevant entity annotations.
Components
Entity Disambiguation
This component is responsible for disambiguating entity classes using a TF-IDF scoring mechanism. It builds TF-IDF documents from spans and scores entity contexts to determine the most appropriate entity class.
Dictionary Linking
This component performs entity linking based on a dictionary lookup. It utilizes a caching mechanism to store and retrieve entity linking candidates efficiently.
Rules-Based Filtering
This component applies a set of predefined rules to filter and disambiguate entity classes. It uses Spacy matchers to identify true positive and false positive matches based on custom extensions and mention patterns.
Cross-Reference Management
This component manages cross-references between different knowledge bases or ontologies. It can build a cache of cross-references, including fetching data from external services like Oxo.
Mapping Strategies
This component defines and implements various strategies for mapping entities to their canonical identifiers. It includes a factory for creating mapping objects and different matching strategies like symbol match and strong match.
Post-Processing Disambiguation Strategies
This component provides a collection of strategies for disambiguating entities after initial linking. These strategies leverage various contextual cues, including TF-IDF scores, document context, and default labels.
Disambiguation Context Scoring
This component is dedicated to generating and scoring context representations for disambiguation. It uses TF-IDF models and n-gram generation to create vector representations of text for similarity calculations.
Strategy Execution Runner
This component orchestrates the execution of various post-processing strategies, including mapping and disambiguation. It groups entities and applies strategies based on confidence levels and symbolism.
Mapping Step
This component represents a processing step within the KAZU pipeline specifically designed for applying mapping strategies to entities. It inherits from a general parser-dependent step.
KAZU Utilities
This component provides a collection of utility functions used across the KAZU system, including caching, grouping, Spacy object mapping, string normalization, and n-gram generation.
KAZU Data & DB
This component encompasses core data structures and in-memory database interactions within the KAZU system, including mappings, equivalent ID sets, and metadata/synonym databases.